15 research outputs found

    Using linguistic knowledge in SMT

    Get PDF
    Thesis (Ph. D. in Information Technology)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 153-162).In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.by Rabih M. Zbib.Ph.D.in Information Technolog

    Automated information distribution in low bandwidth environments

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 1999.Includes bibliographical references (leaves 62-64).by Rabih M. Zbib.S.M

    Statistical Machine Translation Features with Multitask Tensor Networks

    Full text link
    We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.Comment: 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long pape

    R\'esum\'e Parsing as Hierarchical Sequence Labeling: An Empirical Study

    Full text link
    Extracting information from r\'esum\'es is typically formulated as a two-stage problem, where the document is first segmented into sections and then each section is processed individually to extract the target entities. Instead, we cast the whole problem as sequence labeling in two levels -- lines and tokens -- and study model architectures for solving both tasks simultaneously. We build high-quality r\'esum\'e parsing corpora in English, French, Chinese, Spanish, German, Portuguese, and Swedish. Based on these corpora, we present experimental results that demonstrate the effectiveness of the proposed models for the information extraction task, outperforming approaches introduced in previous work. We conduct an ablation study of the proposed architectures. We also analyze both model performance and resource efficiency, and describe the trade-offs for model deployment in the context of a production environment.Comment: RecSys in HR'23: The 3rd Workshop on Recommender Systems for Human Resources, in conjunction with the 17th ACM Conference on Recommender Systems, September 18--22, 2023, Singapore, Singapor

    Segmentation for english-to-arabic statistical machine translation

    No full text
    In this paper, we report on a set of initial results for English-to-Arabic Statistical Machine Translation (SMT). We show that morphological decomposition of the Arabic source is beneficial, especially for smaller-size corpora, and investigate different recombination techniques. We also report on the use of Factored Translation Models for Englishto-Arabic translation.

    Decision Trees for Lexical Smoothing in Statistical Machine Translation

    No full text
    We present a method for incorporating arbitrary context-informed word attributes into statistical machine translation by clustering attribute-quali ed source words, and smoothing their word translation probabilities using binary decision trees. We describe two ways in which the decision trees are used in machine translation: by using the attribute-quali ed source word clusters directly, or by using attributedependent lexical translation probabilities that are obtained from the trees, as a lexical smoothing feature in the decoder model. We present experiments using Arabic-to-English newswire data, and using Arabic diacritics and part-ofspeech as source word attributes, and show that the proposed method improves on a state-of-the-art translation system.

    Improved morphological decomposition for Arabic broadcast news transcription

    No full text
    In this paper, we show the progress for Arabic speech recognition by incorporating contextual information into the process of morphological decomposition. The new approach achieves lower out-of-vocabulary and word error rates when compared to our previous work, in which the morphological decomposition relies on word-level information only. We also describe how the vocalization procedure is improved to produce pronunciations for some dialect Arabic words. By using the new approach, we reduced the word error by 0.8% absolute (4.7% relative) when compared to the baseline approach.United States. Defense Advanced Research Projects Agency (DARPA). GALE program (Contract No. HR0011-06-C-0022

    Fast and Robust Neural Network Joint Models for Statistical Machine Translation

    No full text
    Recent work has shown success in us-ing neural network language models (NNLMs) as features in MT systems. Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window. Our model is purely lexi-calized and can be integrated into any MT decoder. We also present several varia-tions of the NNJM which provide signif-icant additive improvements. Although the model is quite simple, it yields strong empirical results. On the NIST OpenMT12 Arabic-English condi-tion, the NNJM features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. The NNJM features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chi-ang’s (2007) original Hiero implementa-tion. Additionally, we describe two novel tech-niques for overcoming the historically high cost of using NNLM-style models in MT decoding. These techniques speed up NNJM computation by a factor of 10,000x, making the model as fast as a standard back-off LM. This work was supported by DARPA/I2O Contract No. HR0011-12-C-0014 under the BOLT program (Approved fo
    corecore